For this project, 5 questions are to be explored:
Q1, 2 and 3 require descriptive statistics and appropriate visualization.
Q4 and 5 require referential statistics and machine learning techniques.
import eda_utility as u
import predictive_utility as p
import pandas as pd
%matplotlib inline
# import datasets
calendar = pd.read_csv('Seattle_data/calendar.csv')
listings = pd.read_csv('Seattle_data/listings.csv')
reviews = pd.read_csv('Seattle_data/reviews.csv')
In this part, I focused on the first 3 questions regarding price trend, sentiment keywords, and room distribution.
The calendar dataset contains availability and price data of rooms from 01/01/2016 to 01/02/2017. To find out the month with most rooms booked, I grouped count by the month and sort the grouped dataframe in a descending order. The result indicates that based on the number of booked rooms, March and December are the busiest times.
# EDA for seattle
# busy times
month_count = u.busiest_time(calendar, 'Seattle')
month_count
With the date and price data, I plotted the price by date line chart to have an overview of how price change. The plot shows a clear bell-shaped trend and a seasonality with 4 peaks a month. With these features, the series is appropriate for a further decomposition analysis.
Time series are composed of 3 parts, a trend component, a seasonal component, and a remainder component. There are two methods of seasonal decomposition, addictive and multiplicative decomposition. Since magnitude of the seasonal fluctuations does not vary with the level of the time series, I employed addictive method here. The formula is:
y = Trend + Seasonality + Residual
# Time series analysis
# price by date plot
u.time_series_analysis(calendar, 'Seattle')
The 3 components are shown separately in the figure above. We can see a clear increasing trend from January to July and a decreasing trend starting September. Also, the seasonality might indicate a weekend effect. To verity this assumption, the average price for each day of the week by month was plotted in the figure below. Friday and Saturday see a clear price rise while the rest of the weekdays generally remain the same price.
# decomposition into trend, seasonality, and residual
u.weekday_decomposition(calendar)
Listening to the customers is the best and efficient way to detect problems and new opportunities. Since the review dataset does not provide the sentiment label or rating scores, I employed rule-based sentiment analysis models VADER, which works well on social media text. The main steps includes:
From the sentiment distribution plot below, we can find that positive sentiment dominant the reviews, account for around 96% of all.
# Sentiment Analysis
sentiment_df = u.sentiment_analysis(reviews)
u.sentiment_distribution(sentiment_df)
To dive deep into the keywords of customer’s experience with Airbnb in Seattle, I counted the number of each word in reviews and found the top 30 words that most frequently appeared. Since raw texts were preprocessed for analysis, the words here are in their root form.One thing to mention here is that since many negative reviews were written in languages other than English, there are many incomprehensible words in that part.
The word frequency chart below indicates that keywords of pleasant customers include “location”, “clean”, “comfort”, “host”, “neighborhood”, “easy”, “downtown”, etc. For negative reviews, “arrive”, “reservation”, and “cancel” are among the top. It can also be captured that there’s some overlap between both positive and negative reviews, such as “host”, “apartment”, “location”, “room” etc.
Besides exploring the keywords individually, I plotted the word clouds that take collocations into consideration.
From the word clouds below, you can easily capture the key points of customers:
Positive: “highly recommended”, “great host”, “walk distance”, “clean”, “comfortable”, “enjoy stay”, etc.
Negative: “auto post”, “day arrived”, “host cancel”, “cancel reservation” etc.
u.sentiment_word_plots(sentiment_df)
Here is a brief overview of how room type distributed, how price varies among different types, and what are the most popular neighbourhood. Feel free to explore with the information on the interactive maps below.
# listing distribution visualization
# seattle location
seattle_location = [47.6062, -122.3321]
listing_map = u.listing_distribution_map(listings, seattle_location)
listing_map
neighbourhood_map = u.listing_count_map(listings, seattle_location)
neighbourhood_map
For Q4 and 5, I built a linear regression model based on the listing dataset.
The answer is definitely YES. The numeric "price" variable is the response variable and some of the rest are exploratory variables.
The original listing dataset has 92 columns, including many redundant and uninterested features. Before building the regression model, a comprehensive preprocessing and feature engineering is critical.
Main steps includes:
The processed dataset ready for modeling has 42 features.
# Predictive modeling for Airbnb 'price' feature
# Clean dataset
df = p.clean_listings(listings)
For the initial model, I plugged in all the features to first gain an overview on how the model work on the dataset. The performance metric shows that 64.58% of the variance for price can be explained by my model(R-square=0.6458). The small mean absolute error(33) and large mean squared error(3035) tell that there are some outliers that produce hugh difference and drive the MSE to very high. From the actual by predicted value plot, we can find that the regression model performs much better for price less than \$200 as compared to price over \$200. The model tends to underpredict expensive rooms.
Overall, the performance of the initial model is acceptable. We can infer that the price of expensive rooms are more complex, unstable, and influenced by more features that are not used for training. For future optimization, I will try a more detailed feature engineering on text features, such as room amenities, to see if more features will improve the performance.
# Modeling
# Initial model
X, y= p.initial_model(df, test_size=.30, rand_state=42)
Since sklearn linear regression performs ridge regression by default, we can consider columns with larger coefficients as more useful for predicting our response variable. Then the question becomes how large is large enough to decide whether keep the feature or not? As I mentioned before, if a categorical feature is dominated with one level, they are kind of a rare event and not very indicative. Therefore, I tried several cutoffs(the number of the level of interest in a column) for each categorical feature and removed those with the level of interest less than the cutoff. The evaluation metric for the optimal model is R-square of test sets.
The result indicates that there's no big difference among the experimented models. The optimal model is built with 37 features, improving test set r-square to 64.65%.
# Find the optimal model by changing training features
cutoffs = [3500, 3000, 2500, 2000, 1000, 500, 100, 50, 30, 25]
best_r2_test, best_r2_train, lm_model, X_train, X_test, y_train, y_test = p.find_optimal_lm_model(X, y, cutoffs)
For Q5, I would to figure out the most influencing factors that contribute most to the room price. By looking at the coefficient of each feature, we can answer this question. From the coefficient weight table below, we can find that host acceptance rate is the most import factor with far larger coefficient as compared to other predictors. Also, top 10 influencing factors including entire room type, private room type, bathrooms, bedrooms, none-superhost, location score, accommodates, cheak-in score, and value score.
The coefficient plots below provide a better visualization on the both the direction and magnitude for each predictor. Since host acceptance rate is far larger than the rest coefficient, I excluded it on the plot for a better look.
# feature importance
coef_df = p.coef_weights(lm_model, X_train)
coef_df.head(10)